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Abstract 

Adaptive Metropolis-Hastings samplers use information obtained from previous draws 
to tune the proposal distribution automatically and repeatedly. Adaptation needs to 
be done carefully to ensure convergence to the correct target distribution because the 
resulting chain is not Markovian. We construct an adaptive independent Metropolis- 
Hastings sampler that uses a mixture of normals as a proposal distribution. To take 
full advantage of the potential of adaptive sampling our algorithm updates the mixture 
of normals frequently, starting early in the chain. The algorithm is built for speed and 
reliability and its sampling performance is evaluated with real and simulated examples. 
Our article outlines conditions for adaptive sampling to hold and gives a readily ac- 
cessible proof that under these conditions the sampling scheme generates iterates that 
converge to the target distribution. 

Keywords: Clustering; Gibbs sampling; Markov chain Monte Carlo; Semiparamet- 
ric regression models; State space models. 



1 Introduction 

Bayesian methods using Markov chain Monte Carlo (MCMC) simulation have greatly in- 
fluenced statistical and econometric practice over the past fifteen years because of their 
ability to estimate complex models and produce finite sample inference. A key component 
in implementing MCMC simulation is the Metropolis-Hastings (MH) method (Metropolis 
et al. 1953; Hastings 1970), which requires the specification of one or more proposal distri- 
butions. The speed at which the chain converges to the posterior distribution and its ability 
to move efficiently across the state space depend crucially on whether the proposal distribu- 
tion(s) provide good approximations to the target distributions, either locally or globally. 
Given the key role played by proposal distributions, it is natural to use experience from 
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preliminary runs to tune or adapt the proposal to the target. We define strict adaptation 
as adaptation that is subject to theoretical rules which ensure that the iterates converge to 
realizations from the correct target (posterior) distribution. All other adaptation of the MH 
kernel will be called preliminary adaptation, whose purpose is either to obtain an adequate 
proposal density before switching to non-adaptive MCMC sampling, or as the starting point 
for strict adaptive sampling. 

Despite the different theoretical requirements of preliminary and strict adaptation, a 
great deal of care is needed in constructing both types of adaptive sampling schemes because 
the ultimate goal is to obtain reliable estimates of functional of the target distribution as 
quickly as possible. 

The literature on adaptive MCMC methods follows two main strands. Adaptation by 
regeneration stems from the work of Gilks et al. (1998). Our article focuses exclusively on 
diminishing adaptation schemes. Important theoretical advances in diminishing adaptation 
were made by Holden (1998), Haario et al. (2001), Andrieu and Robert (2001), Andrieu 
and Mouhnes (2006), Andrieu et al. (2005), Atchade and Rosenthal (2005), Nott and Kohn 
(2005) and Roberts and Rosenthal (2007). The proofs of convergence for strict adaptive 
sampling arc more complex than for the non adaptive case as the iterates arc not Marko- 
vian because the MH kernel can depend on the entire history of the draws. Although more 
theoretical work on adaptive sampling can be expected, the existing body of results pro- 
vides sufficient justification and guidelines to build adaptive MH samplers for challenging 
problems. 

Research is now needed on how to design efficient and reliable adaptive samplers for 
broad classes of problems. This more applied literature mostly focuses on random walk 
Metropolis, see for example Roberts and Rosenthal (2006). Partial exceptions are Gasemyr 
(2003) who uses normal proposals for an independent Metropolis-Hastings, but limits the 
tuning of the parameters to the burn-in period, and Hastie (2005), who mixes random walk 
and independent MH in reversible jump problems, in what we call two step adaptation in 
section 2. Independent MH schemes are implemented by Nott and Kohn (2005) to sample 
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discrete state spaces in variable selection problems (e.g. to learn if a variable is in or out), 
and by Giordani and Kohn (2008) to learn about interventions, such as breaks or outliers, 
in dynamic mixture models. 

Our paper contributes to the development of both preliminary and strict algorithms 
for adaptive independent MH sampling in continuous state spaces. Increased sampling ef- 
ficiency is obviously one important goal, particularly in cases where current best practice 
(typically some version of random walk Metropolis or Gibbs sampling) is less than satisfac- 
tory. But equally important achievements of adaptive schemes may be to expand the set of 
problems that can be handled efficiently by general purpose samplers and to reduce coding 
effort. In particular, adaptive schemes can reduce dependence on the use of conjugate pri- 
ors. Such priors make it easier to implement MCMC schemes, but are less necessary when 
using adaptive sampling. 

Our adaptive sampling approach is built on four main ideas. The first is to combine 
preliminary and strict adaptation into one estimation procedure. The second is to estimate 
mixtures of Gaussians from the history of the draws and use them as proposal distribu- 
tions for independent MH in both parts of the adaptation. The third is to perform this 
estimation frequently, particularly during the preliminary adaptation stage, a strategy we 
call intensive adaptation. The fourth is to ensure that the theoretical conditions for the 
correct ergodic behavior of the sampler are respected during strict adaptation. To apply 
these ideas successfully, estimation of the mixture parameters needs to be fast, reliable, and 
robust. We achieve a good balance of these goals by carefully selecting and tailoring to our 
needs algorithms developed in the clustering literature. 

We study the performance of our adaptive sampler in three examples in which commonly 
used Gibbs schemes can be very inefficient and compare it with an adaptive random walk 
Metropolis sampler proposed by Roberts and Rosenthal (2006) that builds on the work . 

Our paper also provides conditions and outlines a proof that our strict adaptive sampling 
scheme converges to the correct target distribution and gives convergence rates. 
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2 Two-step adaptation and intensive adaptation 

A necessary condition for a successful adaptive independent Metropolis-Hastings (AIMH) 
sampler is that, given a sizable sample drawn from the target 7r{z), the suggested algo- 
rithm can build a proposal q{z) which is sufficiently close to the target for IMH to perform 
adequately. A two-step adaptive strategy is also feasible whenever the answer is positive. 
We loosely define two-step adaptation as a sampling scheme in which a rather thorough 
exploration of the target density is carried out in the first part of the chain by a sampler 
other than IMH (such as random walk Metropolis) before switching to a more efficient IMH 
sampler with proposal density tuned on the first-stage draws. An early version of such a 
two-step procedure is proposed by Gelman and Rubin (1992). Hastie (2005) provides an 
interesting application to reversible jump problems. 

Two-step adaptation is relatively simple and safe and in some cases is capable of achiev- 
ing sizable efficiency gains. However, it has the following limitations. First, if the first 
stage sampler fails to adequately explore a region of the state space, the proposal built 
for the second stage will also inadequately cover that region. To reduce this risk we may 
need a very large number of iterations in the first phase, which may be particularly time 
consuming if the first stage sampler is inefficient. Second, there may be no saving of coding 
effort if the first stage sampler generates from several conditional distributions, as in Gibbs 
or Metropolis- within-Gibbs, in order to be efficient. 

We loosely define intensive adaptation as an AIMH scheme in which the proposal dis- 
tribution is updated frequently, particularly in the early part of the chain. Building a 
sequence of increasingly good proposal densities in intensive adaptation is more demanding 
than building a proposal density once based on thousands of draws. The question is whether 
we can adequately explore the target distributions given an initial proposal go{z) but no 
draws. The answer inevitably depends on the initial proposal go{z), on the target 7r(z), 
and on the details of the sampling scheme. However, it is possible to outline some general 
dangers and opportunities offered by intensive adaptation. 

Among the advantages, if the proposal distribution is sufficiently flexible, frequent tuning 
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of its parameters and continuing adaptation for the entire length of the chain reduces the 
risk of a long run of rejections and increases the chances of good performance when the 
initial proposal approximates the target poorly. 

Estimating proposal densities based on a small number of draws presents some dangers 
that the designer of an AIMH scheme should try to minimize. For example, suppose that 
we predetermine the iteration, say j, at which the proposal is first updated. If the first j 
draws have all been rejected, then a proposal distribution based entirely on these draws is 
degenerate and makes the chain reducible. If too few draws have been accepted, the proposal 
may be very poor. We employ the following strategy to prevent these outcomes. First, we 
initially update the proposal at a predetermined number of accepted draws. Second, after 
fitting a mixture of normal distributions to past draws, we fatten and stretch its tails by 
creating a mixture of mixtures as described in section 3. Third, we let the proposal be 
a mixture where one component is the initial proposal go{z), which should of course have 
long tails. This is similar to the defensive mixtures approach studied by Hesterberg (1998) 
for importance sampling. The same provisions reduce the risk of adapting too quickly to a 
local mode. 

3 Some theory for adaptive sampling 

A diminishing adaptation MH sampler performs the accept/reject step like a standard MH, 
but updates the proposal distribution using the history of the draws. This updating is 
'diminishing' in the sense that the proposal distribution settles down asymptotically (in the 
number of iterations). 

This section outlines the theoretical framework for strict adaptive independent Metropolis- 
Hastings sampling as used in our article that gives some support for our practice. The 
appendix outlines proofs of the theoretical results, which extend similar results in Nott and 
Kohn (2005) for the case of a finite state space. Our aim is to provide simple accessible 
proofs that will help to popularize the adaptive methodology. All densities in this section 
are with respect to Lebesgue measure or counting measure, which we denote as 
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Let Z be a sample space and 7t{z) a target density on Z. We use the following adaptive 
MH scheme to construct a sequence of random variables {Zn,n > 2} whose distribution 
converges to tt{z). We assume that Zq and Zi are generated independently from some initial 
density qi{z). In our examples, this is a heavy tailed version of the Laplace approximation 
to the posterior distribution. For n > 1, let qn{z;Xn) be a proposal density for generating 
Zn+i, where is a parameter vector that is based on = zo, . . . , -^n-i = Zn-i- Thus, 
given Z„ = z, we generate ^n+i = z' from qn, and then with probability 

a„(.,.') = mi.(l.^^^) (1) 

V T^yz) qn{z'-Xn)J 

we take Zn+i = z'; otherwise we take Zn+i = z. Our choice of qn{z; A„) is 

qn{z; An) = UJigo{z) + (1 - uji)gn{z; Xn) (2) 

where < < 1 and the density go{z) does not depend on A„. We usually fix ui so that 
An = A^; otherwise A^ = (A^, wi). The density gn{z; Xn) is an approximation to 7r{z) whose 
form is discussed below, in section 5 and in appendix 1. We assume that there exists a 
constant K > such that for all z E Z 

^("^ < K, ^-I^M < K , and ^4 < K (3) 



9q{z) ' goiz) ' go{z) 



and 



sup 



9n{z] Xn) - gn+liz; Xn+l) / go{z) 



(4) 



where = 0{n~'^) for some r > almost surely. If Z is compact then (3) holds almost 
automatically. If, in addition. An is based on means and covariances of the z iterates then 
we can show that || An — An+i ||= 0{n~^) and equation (4) also holds. In relation to (3), 
we note that in the non-adaptive case, that is qn{z; An) = q{z) for all n, Mengersen and 
Tweedie (1996) show that 7r{z)/q{z) < K for all z is a necessary and sufficient condition 
for geometric ergodicity. 

Under conditions (3) and (4), the following results are proved in Appendix 2. 
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Theorem 1 For all measurable subsets A 



sup I Pr(Z„ e ^) - 7r(^) I ^ as n ^ oo. (5) 
A<zz 

Theorem 2 Suppose that h{z) is a measurable function that is square integrable with respect 
to the density go. Then, almost surely, 

1 " 

as oo. (6) 

We now describe the form of gn{z;Xn) in (2). For conciseness, we shall often omit 
showing dependence on A; e.g. we will write gniz;Xn) as gn{z). Let gn{z) = gni^'iKi) be 
a mixture of normals obtained using k-harmonic means clustering as described in section 5 
and appendix 1, and let ^J^(z) be a second mixture of normals having the same component 
weights and means as 5^(2;), but with its component variances inflated by a factor A; > 1. 
Let 

9n{^)=^'2~9l{z) + {l-JM^) , (7) 

where LO2 = 0^2/(1 — <^i) with uji defined in (2), loi > 0,lo2 > 0, and uji + UJ2 < 1. We note 
that g^iz) is also a mixture of normals, and we say that g^iz) is obtained by 'stretching 
and fattening' the tails of gn{z)- This strategy for obtaining heavier tailed mixtures is used 
extensively in our paper. 

To allow for diminishing adaptation, we introduce the sequence where < /3„ < 1, 
and define 

gn{z) = {l- I3n)9n{z) + /3n9n-l{z) (8) 

with gniz) = gn{z] An), where An is a function of A* and Alternatively, we can define 

~Xn = {l-Pn)Xn+Pn~Xn-l (9) 

and gn{z) = gn{z;Xn)- 
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If we restrict Z to be compact and let /?„ ^ 1 at an appropriate rate then it is straight- 
forward to check in most cases that the dominance and diminishing adaptation conditions, 
(3) and (4), hold. If Z is unconstrained but we restrict the A* to lie in a bounded set then 
we can do rough empirical checks that (3) and (4) hold by taking gQ{z) to be sufficiently 
heavy tailed. In our empirical examples we often find that we can take /?„ = for all n 
because A* converges to a A* at a sufficiently fast rate as n increases. This means that if 
g'o(-z) is sufficiently heavy tailed then (3) and (4) hold as n increases. 

Section 7 and appendix F of Andrieu and Moulincs (2006) give general convergence 
results for adaptive independent Metropolis-Hastings and Roberts and Rosenthal (2007) 
give an elegant proof of the convergence of adaptive sampling schemes. However, we believe 
that readers may find the conditions (3) and (4) and the proofs of Theorems 1 and 2 easier 
to understand for the methodology proposed in our article. 

4 Implementation of the adaptive sampling scheme 

This section outlines how our sampling scheme is implemented. We anticipate that readers 
will use this as a basis for their own experimentation. In the preliminary phase the density 
5o {.z) in (2) is a mixture of the Laplace approximation to the posterior and a heavier tailed 
version of the Laplace approximation, using weights of 0.6 and 0.4. By a heavier tailed 
version we mean a distribution with the same mean but with a covariance matrix that is 25 
times larger. If the Laplace approximation is unavailable, then we use the prior. At the end 
of the preliminary phase, g^iz) is constructed as a mixture of the last estimated mixture, 
which we call fl'iast(^) say, and a heavier tailed version of 5'iast(^)- That is, the component 
weights and means are set to those of (7iast(-2), and its component variances equal to /c = 25 
times the component variances of gy^siiz)- 

In our empirical work we set the weights ujx = 0.05, and UJ2 = 0.15 in (2) and (7). 
We also inflate the component variances of g^iz) by a factor of A; = 16 to obtain the 
corresponding variances of gniz)- These values have been found to work well but arc not 
optimal in any specific sense. We conjecture that the speed of convergence and efficiency of 
our sampler could be further improved with a more careful (and possibly adaptive) choice 
of these parameters. Any other value of k in the range 9-25 and of uji and UJ2 in the range 
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0.05 — 0.3 worked well for the examples given in this paper, and loi could be set to in the 
preliminary phase without affecting the results. 

In our empirical work, during the preliminary phase, when there are 2 to 4 unknown 
parameters as in the inflation and stochastic volatility examples, we first re-estimate the 
k-harmonic means mixture after 20 accepted draws in order to ensure that the estimated 
covariance matrices are positive definite. If our parameter space is bigger then we would 
increase that number appropriately. We then re-estimate the mixture after 50, 100, . . . , 
350, 400, 500, . . . , 1000, 1500, . . . , 3000 draws and then every 1000 draws thereafter. We 
also recommend updating the proposal following a period of low acceptance probabilities in 
the MH step. Specifically, we re-estimate the mixture parameters if the average acceptance 
probability in the last L iterations is lower than a^, where we set ai = 0.1 and L = 
10. Low acceptance probabilities signal a poor fit of the proposal, and it is therefore 
sensible to update the proposal to give it a better chance of covering the area around the 
current parameter value. Since it is unclear that this does not violate any of the conditions 
required for the ergodicity of our adaptive sampler, we limit the updating of the proposal 
at endogenously chosen points to a preliminary period, after which the proposal is updated 
only at predetermined intervals. The end of the preliminary adaptation period could be set 
ex-ante, but we prefer to determine it endogenously by requiring the smallest acceptance 
probability in the last M iterations to be higher than aM, where M is set to 500 and om 
to 0.02. During the period of strict adaptation, we update the proposal every 1000 draws. 

We conjecture that Theorems 1 and 2 will still hold if we update the proposal after 
a sequence of low acceptance probabilities so that we could also use this update strategy 
during the period of strict adaptation. However, we have not implemented this in our 
empirical analyses. 

The estimation of the mixture of normals can become slow when the number of iterations 
is large. To avoid this problem, after 1000 accepted draws we only add every j-th draw to 
the sample used to estimate the mixture parameters, where j is chosen so that the mixture 
is not estimated on more than 10000 observations. 

When most parameters are nearly normally distributed, fitting a mixture of normals 
to all the parameters is problematic in the sense that the chances of finding a local mode 
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with all parameters normally distributed is quite high (though depending on the starting 
value of course). This is true of clustering algorithms and also of EM-based algorithms. To 
improve the performance of the sampler in these situations, we divide the parameter vector 
9 into two groups, 9i and 62, where parameters in 61 are classified as approximately normal 
and parameters in 62 are skewed.^ A normal is then fit to the first group and a mixture 
of p normals to the second. For 9i, we can compute the mean and covariance matrix 
Eg^ inexpensively from the draws. For 62, we fit a mixture of normals as detailed below, 
estimating probabilities 7ri,...,7rp, means ^i, fip, and covariance matrices We 
then need to build a mixture for 6 = {9i,02}- The mean is straightforward: for the normal 
parameters, all components have the same mean. The diagonal blocks of the covariance 
matrices corresponding to var(^i) and var(02) for component i are also straightforward. 
The off-diagonal blocks of Oj, corresponding to cov(0i, 62) is computed as 

n n 

t=i t=i 

where 7r*j = Pr(if( = i|^2,t) is the probability of ^2,t coming from the i-th. component. 

5 A clustering algorithm for fast estimation of mixtures of 
normals in adaptive IMH 

Finite mixtures of normals are an attractive option to construct the proposal density because 
they can approximate any continuous density arbitrarily well and are fast to sample from 
and evaluate. See McLachlan and Peel (2000) for an extensive treatment of finite mixture 
models. 

However, estimating mixtures of normals is already a difficult problem when an inde- 
pendent and identically distributed sample from the target is given and estimation needs 
to be performed only once: the likelihood goes to infinity whenever a component has zero 

variance (an even more concrete possibility if, as unavoidable in IMH, some observations 

^Our rule of thumb is to place a parameter in the 'normal' group if its marginal distribution has |s| < 0.2, 
where s is the skeweness. Our fattening the tails of the mixture should handle the kurtosis, though this 
would optimally be done with mixtures of more flexible distributions than the normal. 
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appear more than once), and its maximization, whether by the EM algorithm or directly, 
is plagued by local modes. Although several authors have made substantial advances in 
dealing with these problems (e.g. Figuereido and Jain 2002; Ueda, Nakano, Ghahramani, 
and Hinton 2000; Verbeek, Vlassis, and Krose 2003), in our experience the EM algorithm 
does not seem to be sufficiently reliable when the sample is small and contains a non-trivial 
share of rejected draws. The inevitable short runs of rejections give rise to small clusters 
with zero covariance matrix at which the EM algorithm frequently gets stuck. Moreover, 
EM-based algorithm arc computationally expensive and slow to converge, which makes 
them less attractive when the proposal is to be updated frequently. 

We have therefore limited our attention to algorithms that estimate mixtures of normals 
quickly and without explicitly computing the covariance matrix of each component (for 
robustness). Within this class, the k-means is the most popular algorithm. We employ the 
k-harmonic means, an extension of the k-means algorithm that allows for soft membership. 
Degeneracies can be easily prevented, so the algorithm is remarkably robust even in the 
presence of a long series of rejections. The number of clusters is chosen with the BIG 
criterion. The increase in speed and reliability is paid for with a decreased fit to the 
target, as the family of k-mcans algorithms performs best when an optimal fit requires all 
components of the mixture to have the same probability and covariance matrix (see Bradly 
and Fayyad 1998, for a discussion). Hamerly and Elkan (2002) show that the performance of 
k-harmonic means deteriorates less rapidly than alternatives of similar computational cost 
with departures from these ideal conditions. An outline of the k-harmonic means algorithms 
is given in Appendix 1. 

Although the k-harmonic means algorithm is less sensitive to initialization than either 
k-means or EM (Hamerly and Elkan 2002) , in an unsupervised environment it is important 
to have good starting values. We have found the algorithm of Bradly and Fayyad (1998) to 
perform very well and at a low computational cost. 

If the proposal distribution is normal then it is computationally inexpensive to update 
it at every iteration. It is tempting to update a mixture of normals proposal with an 
on-line estimation procedure such as the on-line EM algorithm proposed in Andrieu and 
Moulines (2006) . The advantage of on-line estimation is that the parameters of the mixture 
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are updated recursively, so the proposal itself is updated at each iteration at a very small 
computational cost. However, on-line estimation of the mixture parameters in AIMH has 
a number of serious shortcomings. The estimates are inefficient compared to batch esti- 
mators because each data point is used only once, which corresponds to requiring a batch 
optimization to converge in one step. The loss of efficiency is more severe in small samples, 
that is, in the early phases of the chain. Direct estimation of the mixture component covari- 
ance matrices often leads to numerical problems in the early phases of the chain given that 
rejections in MH produce degenerate clusters. Finally, a limitation of on-line estimators 
is related to the fact that they are a form of stochastic gradient descent (see Spall (2003) 
for an introduction). When the function to be maximized is multimodal (as typically the 
case with mixtures) on-line estimates are in general sensitive to the order of the draws, 
with initial draws having heavier influence than later draws in determining the mode at 
which estimates settle down. We have verified empirically that the quality of solutions 
given by several on-line algorithms deteriorates rapidly if the initial observations are not 
representative of the entire target distribution. This makes on-line algorithms unsuitable 
for use in the early, exploratory phases of the chain, though they can work well if the initial 
proposal distribution already provides a reasonably good approximation of the target and 
the acceptance rates are sufficiently high. 

Since we are opting for batch estimators, it is too costly to update the proposal at each 
iteration. We update it at predetermined numbers of iterations, more frequently in the 
earlier stages of the chain. Implementation details for the empirical examples are given in 
section 4. 

We make two further comments on Andrieu and Moulines (2006). First, they propose 
to keep the number of components in the mixture constant, whereas we have found it 
advantageous to select the number adaptively as outlined in appendix 1. Second, they 
outline a proposed approach to using mixtures as proposal densities, but do not report on 
the empirical performance of their proposal. 
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6 Discussion 



In order to understand the strengths and hmitations of our sampler, we have found it useful 
to consider two desirable qualities of an adaptive IMH scheme. First, given a sufficiently 
large sample drawn from the target, we wish to construct a proposal density which fits the 
target as well as possible. This is an approximating ability: we want to draw an accurate 
'map' of the areas that we have already explored. Second, we wish the sampler to perform 
as well as possible when the initial proposal fails to cover part of the support of the target 
distribution. This is an exploring ability: when we propose in a region where our map is 
poor, we want to explore that region and quickly update our map. 

For example, using a normal proposal when the target is highly non-normal results in 
little approximating ability. Updating the proposal only once or very rarely results in little 
exploring ability, since the proposal reacts slowly or not at all to the information that it is 
fitting poorly at a given point. 

Our sampler has several characteristics designed to enhance its exploring ability. Fre- 
quent updating, particularly at early iterations, and updating following a sequence of low 
MH acceptance probabilities quicken the pace at which the proposal adapts to the infor- 
mation that it is not fitting well in a certain area. Long tails are useful not only to satisfy 
(3) and (4), but also to improve the chances of venturing into unexplored parts of the state 
space. Finally, mixtures are ideally suited for this exploration because a new component 
can be centered on a value causing a sequence of rejections. The long runs of rejections 
that can plague standard IMH are therefore much less likely using our AIMH sampling 
scheme because the proposal distribution is updated frequently and will accommodate the 
cluster of rejections by changing the mixture parameters or adding a new component. If 
our sampler makes a move in a region where the proposal fits poorly, it should therefore be 
able to explore it. Of course as the parameter dimension increases, if the initial proposal 
fails to cover a region we may never explore that region simply because the probability of 
making a proposal there is too small. 

The next example shows that in low dimensions we can often get away with a very poor 
initial proposal distribution. 
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Example 1 Suppose that the target is the univariate mixture 

tt{z) = 0.5(/>(z; 0, 1) + O.30(z; -3, 4) + O.20(z; 6, 0.5), 

and the initial proposal is 4>{z] —5,4). This proposal has very high importance weights ir/g 
in a large part of the support of z, but we still quickly converge to a good approximation of 
the target. Figure 1 shows that the acceptance rates increase fast initially and then stabilize 
as the proposal distribution also settles down. 

The next example shows that as the dimension increases a good initial proposal distri- 
bution becomes more valuable. 

Example 2 Consider the fifteen dimensional target distribution which is the mixture of two 
normals 

Tr{z) = 0.7(f){z; 0, /) + O.S(f>{z- /X2, 27), 

where (f){z; fj,, S) is a multivariate normal density with mean /x and covariance matrix S 
evaluated at z. The vector 112 has elements /X2,i = for i = 1,...,14, and /i2,i5 = —3. 
The first fourteen marginals are therefore symmetric but slightly leptokurtic, whereas the 
fifteenth is also skewed. The proposal distribution is initialized by fattening and stretching 
the tails of the Laplace approximation, that is, a normal distribution centered at the mode 
and with covariance set to minus the inverse of the Hessian of the log-likelihood at the mode. 
The Laplace approximation is nearly equal to (l){z; 0, /), so we have go{z) ~ 0.6(l){z; 0, /) + 
O.40(z; 0, 16/). Figure 2 shows that the acceptance rates at the initial proposal arc not high, 
but sufficient to start the learning process. The AIMH learns the marginal distribution 
of the non-normal variable rather well and, since most variables are normal, at very low 
computational cost since we only estimate the mixture parameters on variables that appear 
skewed. In contrast, an initial proposal such as (j){z;m,4:I), where m = —5 generates 
such low acceptance rates for this fifteen dimensional distribution that the learning process 
cannot get successfully started. 
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7 Applications 

State space models and nonparametric models are ideal initial applications for AIMH 
schemes. Although they can have a large number of parameters, it is often the case that, con- 
ditional on a small subset, most parameters can be integrated out or have known analytical 
form. Therefore it is often possible to draw all parameters in one or two blocks. Exploiting 
these features, it is also often inexpensive to find the posterior mode, possibly for a simpli- 
fied version of the model, and therefore obtain a reasonable initialization of the proposal 
distribution. Finally, the standard approach based on Gibbs and Metropolis- within- Gibbs 
can be very inefficient, particularly for state space models (see Fruhwirth-Schnatter 2004). 

For each of our applications we checked the results of the adaptive sampling scheme by 
re-running the sampler at a number of different starting points using a fixed proposal based 
on the last mixture in the strict adaptation phase. In all cases we got very similar results 
to those obtained using strict adaptation. 

For our examples we define the inefficiency of a sampling scheme as the factor by which 
the number of iterates would need to increase to give the same precision (standard error) as 
a sampler generating independent draws. For two sampling schemes A and B say, we define 
the inefficiency of scheme B relative to A as the factor by which it is necessary to increase 
the running time of B in order for it to obtain the same accuracy as A. It is computed as 
the inefficiency factor of B times its run time per iteration divided by the inefficiency factor 
of A times its run time per iteration. 

In the examples below we compare the performance of the AIMH sampler to the following 
version of the Haario et al. (2001) adaptive random walk Metropolis sampler proposed on 
page 3 of Roberts and Rosenthal (2006). Specifically, let 6 be the parameters in the model, 
9 the posterior mode and V the variance covariance matrix of the Laplace approximation 
to the posterior. Then at iteration j the proposal distribution is given by 



where N{9, V) is the normal density with mean 6 and covariance matrix V, 6^ is the current 



gj(r,-) = A^(^',(o.i)V/d) if 

Q,(r, •) = (!- P)N{9', {2.38)%/d) + PN{9'^, {d.lfh/d) if 



j > 5d, 



j < M, 
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value of ^, d is the dimension oi 9, (3 = 0.05 and is the current empirical estimate of the 
covariance matrix of the target distribution based on the iterates thus far. In all cases we 
initialized this sampler at the posterior mode. 

7.1 Time- varying parameter autoregressive models 

Consider the following time-varying parameter first order autoregressive (AR(1)) process 

(the extension to a more general autoregressive process is straightforward): 

yt = ct + PtUt-i + <7eet , ct = ct-i + XoaeUt and pt = pt-i + XiVt, (10) 

where et,ut,vt are all nid{0, 1). The model has three parameters (cr^, Aq, Af), while cq and 
Po can be treated either as parameters or (our choice) as states. Given conjugate priors 
(inverse gamma for the parameters, and normal for cq and po), Gibbs sampling is straight- 
forward (Carter and Kohn 1994). Fruhwirth-Schnatter (2004) reports that based on the 
autocorrelations of the iterates, Gibbs sampling can be very inefficient for these models. In 
the following application we also find that the Gibbs draws are highly autocorrelated and, 
by comparing posterior statistics from Gibbs sampling and from our AIMH, we also find 
that the autocorrelations do not reveal the full extent of the problem. 

7.1.1 Application: US CPI inflation 

We apply the model to quarterly U.S. CPI inflation for the period 1960-2005 (184 observa- 
tions).^ We use rather dispersed inverse gamma priors for cr^ , Aq, Xf with a common shape 
parameter of 1. The scale parameters are defined by setting the modes of the priors close 
to maximum likelihood estimates: CTQj^g for (where cr^^^ is the residual variance from 
an AR(1) model estimated by OLS), at O.OIctq^^ for Aq and at 0.001^ for Af . The modes of 
Aq and Af are centered at the maximum likelihood estimates to ensure that the bimodality 
in the posterior distribution of the log of Af documented in Figure 4 is not induced by the 
prior. 

^Annualized quarterly CPI inflation, defined as 4O0{Pt/Pt-i — 1), where Pt is aggregated from monthly 

data (averages) on Consumer Price Index For All Urban Consumers: All Items, seasonally adjusted, Series 
ID CPIAUCSL, Source: U.S. Department of Labor: Bureau of Labor Statistics. 
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For given parameters, the likelihood is easily computed via the Kalman filter. It is 
therefore simple to find the posterior mode, at which the chain is initialized. Posterior 
mode values suggest that time variation is nearly absent. 

Starting with Gibbs sampling, we draw 40 000 times after a burn-in of 5000. The re- 
cursive parameter means seem to settle down (not reported) and the posterior distributions 
are in line with a normal approximation taken at the mode, suggesting a persistent AR(1) 
with little sign of parameter variation (see Figure 3). It may therefore seem reasonable to 
assume that the chain has produced a sample representative of the entire posterior. 

However, the AIMH scheme tells a different story. The proposal is initialized at a mixture 
of two normals go{z) = 0.5(f){z;Ji,T,) + 0.5<^{z;fi, 16S), where /x is the posterior mode and 
— S is the inverse of the Hessian of the log-posterior evaluated at Jl. The AIMH soon 
discovers that the posterior distribution of log(Af), not to mention Af, is highly non-normal 
(see Figure 4), with substantial probability mass around a second mode corresponding to 
non-trivial amounts of time variation in pt and a lower pi. 

We also ran the adaptive random walk Metropolis sampler outlined at the start of the 
section. The sampler settles down to an acceptance rate of 20% and obtains the correct 
posterior distribution, and in particular finds both modes. Table 1 gives the inefficiency 
factors for all three samplers as well as the inefficiency factors of the Gibbs and ARWM 
relative to AIMH. The table shows that the AIMH sampler is appreciably more efficient 
than the other two samplers. 

7.2 Additive semipcirametric Gaussian models 

In this example we consider the additive semiparametric regression model with Gaussian 
errors, with some of the covariates entering linearly and the others entering more flexibly 



m 



H 




(11) 



h=l 
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the Si are nid{0, 1) and z is a vector of regressors that enter linearly. The Xh, h = 1, H 
are covariates that enter more flexibly by using the quadratic polynomial spline functions 



where x+ = x ii x > and otherwise and {xh,i, ■■■,Xh,k} points (or 'knots') on the 
abscissae of Xh such that min(x^) = Xh,i < ... < Xh,j < max(a;^). In this paper we choose 
30 knots so that each interval contains the same number of observed values of Xh- For a 
discussion of quadratic spline bases and other related bases see chapter 3 of Ruppert et al. 
(2003). We assume that a global intercept term is included in z in (11) and for simplicity 
we include the parameters Phfij h = 1, . . . , H in the vector 7 and x^, h = 1, . . . , H as part of 
the vector z. This transforms the nonparametric model into an highly parametrized linear 
model 



The prior for the linear parameters 7 is normal with a diagonal covariance matrix 7 ^ 
N{0, v^I), where Vj can be set to a large number. It is also convenient to assume a normal 
prior for the nonparametric part, with all parameters independent and Ph,j ~ -^(0) 7"^)) ) j = 
1, J, h = 1, H. However, with this prior there is a high risk of over-fitting if we simply 
set r| to a large number. The variance is often chosen by cross-validation, but in a fully 
Bayesian setting we can treat as a parameter. To illustrate the advantage of AIMH in 
working with different priors, we experiment with two options for the prior r^. The first 
prior is log-normal and rather dispersed: In(r^) ~ N{0,5^), the second is inverse gamma 
with shape parameter 1 and scale parameter implied by setting the mode at 0.1^. The 
prior for is inverse gamma with shape parameter one and scale parameter implied by 
setting the prior mode at the OLS residual variance estimated on (13). The prior for 7 = 
(7, Ph) is therefore jointly normal conditional on = {rf , .., r^}, 7|r ~ N{0, V^(r)), 

where V^(r) = diag(w^/, , I, . . . ,r|^/) is a block diagonal matrix. One way to estimate 
the posterior density of the semiparametric model is to use Gibbs or Mctropolis-within- 
Gibbs sampling as proposed by Wong and Kohn (1996). In this approach the parameters 



J 

fh{xh,i) = Po,hXh,i + ~ ^hj)^ = Po,hXi + gh{xh,i), 



(12) 



y = Zj + e. 



(13) 
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7 = {'J, Pi, Ph} are conjugate given 9 = {c^ , rf , r^}, and a"^ is conjugate given 7. 
Each variance r| can be updated with a Gibbs step for the inverse gamma prior, or with 
a Metropolis-Hastings step for the log-normal prior. In this second case, we use a Laplace 
approximation of p(ln(T^)|/3/j), which is very fast to compute using analytical derivatives. 
However, the correlation between and {Ph,i, ■■, l3h,j} could be quite high using either 
prior for r^. In addition, using a log normal prior for leads to high rejection rates in the 
Metropolis-Hastings step when generating the r^. Both problems are elegantly solved by 
integrating out 7 and generating ^ as a block using an efficient AIMH sampler. 

The next example shows how to update all parameters in one block with an efficient 
AIMH sampler. We first note that, conditional on 6, 7 can be integrated out, making it 
possible to compute p{0\y) oc piy\9)p{9), where y\9 ~ N{0, + ZV~(r)Z'). 

7.2.1 Application: Boston housing data 

We use the Gaussian semiparamctric model to study the Boston housing data introduced 
by Harrison and Rubinfield (1978) and analyzed semiparametrically by Smith and Kohn 
(1996). ^ There are 506 observations. The dependent variable is the log of MV, the 
median value of owner-occupied homes. We use all 13 available covariates (see Smith and 
Kohn or the web-site for a full description) in the linear part and the following six in the 
nonparametric part (Smith and Kohn use only the first five): X5 = NOX, nitrogen oxide 
concentration, Xq = RM, average number of rooms, Xg = DIS, logarithm of the distance 
from five employment centers, Xiq = TAX, property tax rate, X13 = STAT, proportion of 
the population that is lower status, Xi = CRIM, per capita crime rate by town. 

The proposal distribution for the seven parameters 9 = {In(cr^), ln(T|), In(T^)} is 
initialized by fattening the tails of the Laplace approximation. To find the Laplace ap- 
proximation, we simply apply Newton-Raphson optimization (with numerical derivatives) 
to \n.p{y\9) + \n.p{9), which involves no extra coding effort since both densities are needed 
to compute the MH acceptance ratio. Figure 5 provides results for the case of a log-normal 
prior on r^, h = 1,...,H and shows that the acceptance rate quickly improves and stabi- 
lizes at around 60% when all seven parameters are updated jointly. Most parameters are 

^The dataset is available at www.es. utoronto.ca/"" delve/data/boston. 
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approximately lognormally distributed, except those connected to the variables TAX and 
CRIM, which benefit from the added flexibility of mixtures. The correlation matrix of the 
smoothing parameters {ln(r|), ...,ln(r^)} is nearly diagonal. This suggests that the AIMH 
could handle large numbers of smoothing parameters efficiently by updating them in blocks 
(with a different proposal density estimated adaptively on each block), since the blocks 
would be nearly independent of each other. 

Table 1 reports the inefficiency factors for both the Gibbs sampler and the AIMH sampler 
for both inverse gamma and log normal priors, as well as the inefficiency of the Gibbs 
sampler relative to the AIMH sampler. The table shows that in terms of relative efficiency 
(defined at the beginning of section 7), the AIMH is about 40% more efficient than the Gibbs 
sampler when both samplers use the inverse gamma prior on r^, and nearly seven times 
more efficient when both samplers use the log-normal prior. Reported results are for the 
average inefficiency factors (over both h and i) of fh{xh,i)- Looking at the autocorrelation 
of the log-parameters gives similar inefficiency ratios. 

We also applied the adaptive random walk Metropolis sampler to this data set, but could 
not make it work well. With the sampler initialized at the posterior mode, the acceptance 
rate started at over 50%, but within a few hundred iterations fell to below 1% and stayed 
there indefinitely. We do not report any inefficiency factors for this sampler because we 
do not believe that inference is reliable with such a low acceptance rate. We conjecture 
that the poor behavior of the ARWM sampler in this example compared to the other two 
examples is because this example has 7 parameters whereas the other two have 3 and 2 
parameters. In addition, the second derivatives of the log posterior in this example are far 
from constant, so a unique covariance matrix may do very poorly. By contrast, a mixture 
of normals allows for local correlations between the parameter and therefore may be less 
affected. 

7.3 Stochastic volatility models 

The simplest stochastic volatility model can be written for mean corrected data as 

y^ = e^-^^*et , ht = fi + p{ht-i - n) + avt, (14) 
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Boston 


mean fhixh,i) 




Inflation 


\og{af) 


log(A^) 


log(A^) 


AIMH, IG 


2.6 




AIMH 


6.7 


2.8 


6.1 


Gibbs, IG 


6.3 (1.4) 




Gibbs 


9.4 (1.3) 


113.3 (37.4) 


156.4 (23.7) 


AIMH, LN 


1.6 (6.8) 




ARWM 


21.5 (3.1) 


23.5 (8.3) 


23.6 (3.8) 


M-Gibbs, LN 


18.4 













Table 1: Inefficiency factors for the semiparametric (Boston) and state space (inflation) 
models, together with the inefficiencies of the Gibbs sampler and the ARWM relative to the 
AIMH sampler in brackets. AIMH: adaptive independent Metropolis-Hastings; M-Gibbs: 
Metropolis- within- Gibbs; and ARWM: adaptive random walk Metropolis. IG and LN: 
inverse gamma and log-normal priors for the Boston data. 



where et is md(0, 1) and the model parameters are 6 = {jjL,p,a}. We square and take logs 
of the observation equation, and we approximate the distribution of In(e^), which is the log 
of a chi-squared 1, by a mixture of normals as in Kim et al. (1998). This model has a 
conditionally Gaussian state space form 

yt = g{Kt) + ht + G{Kt)ut , ht = 11 + p{ht-i - n) + avt, (15) 

where et is nid{0, 1), yt = ln{yf), and g{Kt) and G{Kt) are known given Kt. 

The indicators K can be sampled in one block given 9 and h as in Carter and Kohn 
(1994). The distribution of 9 given h is conjugate, but Kim et al. (1998) show that 9 and 
h are highly correlated and recommend drawing 9 given y and K but integrating h out. 
This is accomplished with a Metropolis-Hastings step, where p{y\K, 9) is computed via the 
Kalman filter. Since the posterior mode is not readily available, Kim et al. (1998) use IMH, 
where the proposal distribution is calibrated once from draws obtained with a less efficient 
sampling scheme. This is less efficient than our scheme and requires coding two different 
samplers. An alternative we now explore is to use AIMH from the beginning of the chain. 

7.3.1 Application: USD-GBP daily returns 

We analyze daily U.S. dollar - British pound returns (defined as the first difference of the log 
of the nominal exchange rate) for the period January 1990 to March 2004. The parameter 
/i can be integrated out (see Kim et al. (1998)). To initialize the proposal distribution, we 
approximate the distribution of In(ef) as a normal with mean —1.27 and variance 2.22^. 
This gives a standard Gaussian state space model, for which the Laplace approximation 
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is easily available. We also use the mode to center the priors for p and ln((7^), which are 
normal and dispersed. The prior for p is truncated at 1. With fattened tails, the initial 
proposal gives an acceptance rate of around 10%, and Figure 6 shows that it takes only a 
few hundred iterations for the acceptance rates to increase to around 50%. This number is 
satisfactory given that the proposal approximates p{0\y) while the acceptance probability 
is computed on p(9\y, K). 

The adaptive random walk Metropolis sampler also quickly settled down to an accep- 
tance rate of around 15%, with the chain mixing well. The inefficiency factors for p and 
(T^ for both samplers are given in table 2, together with the inefficiency factors for the 
adaptive random walk Metropolis relative to AIMH. The table shows that AIMH compares 
favourably with ARWM. 



Sampling Scheme 


P 


a' 


AIMH 
ARWM 


7.6 

21.1 (2.7) 


9.2 

26.5 (2.86) 



Table 2: Inefficiency factors for the stochastic volatility model for the adaptive independent 
Metropolis-Hastings (AIMH) sampler and the adaptive random walk Metropolis (ARWM) 
sampler, with the relative inefficiency of the AWRM in brackets. 

8 Conclusion 

This paper shows that it is possible to build adaptive independent Metropolis-Hastings 
samplers that can do better than two-step adaptation because they adapt throughout the 
sampling period. The most interesting applications arise when current best practice is 
inefficient or cumbersome and, in our opinion, when adaptation starts early. Our article 
provides a fast and reliable algorithm which performs well in three interesting models and 
compares favorably on these examples with a standard Markov chain Monte Carlo sampler 
and an adaptive random walk Metropolis sampler. 
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Appendix 1: k-harmonic means clustering 

We estimate the mixture of normal parameters using the k-harmonic means clustering 
algorithm which can be described as follows. (See Hamerly and Elkan 2002, for a discussion). 
Let p be the number of clusters. 

1. Initialize the algorithm with ci, ...,Cp, the component centers. The starting values are 
chosen with the procedure of Bradly and Fayyad (1998) . We depart slightly from 
Bradley and Fayyad in using the harmonic k-means algorithm (rather than k-means) 
in the initialization procedure. 
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2. For each data point 9t, compute a weight function w{Ot) and a membership function 
m{ci\9t) for t = 1, ...,p as 

'^i^t) = un TTZ;;^ and m(cj|6't) - 



where — Ci|| is the EucUdean or Mahalanobis distance. Fohowing Bradly and 
Fayyad (1998), we put a lower boundary e on \\6t — Cj|| (to avoid degeneracies when 
trying ||ci — Ci||). The membership function softens the sharp membership of the k- 
means algorithm, so one observation can belong to more than one cluster in differing 
degrees. The weight function gives more weight to observations that are currently 
covered poorly (i.e. that are far from the nearest center). 

3. Update each center Cj 

Et=im{ci\et)w{et)et 



Ci 



Y:tim{ci\et)w{et) 



4. Repeat until convergence. This gives the cluster centers, which we take as estimates 
of the component means. The other mixture parameters can then be estimated for 
1 — 1 , •••J k cis 

T.t=im{ci\et)wiet){9t - Ci){et - Cj)' , / |.^ 



J2t=iMci\&t)w{et) 



5. The number of clusters is chosen with the BIC criterion given a maximum number (5 
in our examples). 

We notice that the covariance matrices Vi are only estimated once, after convergence, 
k-means type algorithms also differ from the EM algorithm in that they do not evaluate 
the likelihood p(^|ci, tti^Fi, ...). This sub-optimal use of information in fact turns out 
to be a great advantage for our purposes. Fewer iterations than for EM are needed for 
convergence, and each iteration is faster. Even more importantly, the algorithm does not 
get stuck in the small degenerate clusters caused by rejections in the sense that, Tuilike for 
the EM algorithm with freely estimated covariances, these small clusters are not absorbing. 
If k-harmonic means does find a degenerate cluster, this causes no trouble for convergence, 
and after convergence we can use a predefined matrix in place of any non-positive-definite 
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covariance matrix (for example, if Vi is not positive definite we set it to 0.5^Var{9)). If 
desired, the mixture parameters can be refined with a few steps of the EM algorithm. In 
this case, we recommend not updating the the covariance matrices for the reasons just 
discussed. 

Appendix 2: Proofs 

The one-step transition kernel for Zn+i in section 3 is given by 

Tn{z, dz') = an{z, z')qn{z')ii{dz'} + 5,{dz'){l - Vn{z)) (16) 
where 5z{dz') = 1 if z € dz' and is otherwise, and 

Z^n(^) = 0!n{z, z')qn{z')lJ,{dz'}. (17) 

By the construction of the MH transition kernel, 

J Tr{z)Tn{z,dz')n{dz} = TT{z')n{dz'} . (18) 

In this section K is a generic constant, independent of n, z and z'. It is convenient to 
write hn{z; Xn) as hn{z) Without loss of generality we assume throughout this section that 
Z is a discrete space. Exactly the same proof goes through for the continuous case with 
summations replaced by integrals. We use the notation Zs-.t to mean {zg, ■ ■ ■ , zt} for s < t, 
with a similar interpretation for Zg-.t- 

To prove Theorem 1 we first obtain the following two lemmas. 

Lemma 1 Under the assumptions of Section 2, for any n,k > and z, z' G Z, 

(a) qn{z) < Kgo{z). 

(h) an{z,z')qn{z') <Kgo{z') 

(c) There exists an ei, < ei < 1, such that an{z, z')qn{z') > eiTr{z') for all z,z' G Z. 

(d) Un{z) > ei for all z G Z, where Vniz) is defined by (17). 
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(e) For k>\, let A„(z, z') = an{z, z')qn{z') - an+i{z, z')qn+\{z'). Then, 

I A„(z,zO I < K(^go{z') + ^5o(^))a„. (19) 

(f) 

\vn{z)-Un+iz)\<K{l + ^^an (20) 

Proof. (a) qn{z)/go{z) = u>i + {1 — u>i)gn{z) / go{z) and the result follows from (3). 
(b) follows from (a) and a„(z, z') < 1. To show (c), note that qn{z)/Tr{z) > u>igo{z)/Tr{z). 
Prom (3), there is an ei such that qn{z)/TT{z) > ei for all z £ Z. It is now straightforward 
to show that an{z, z')qn{z')/Tr{z') > ei for all z,z' G Z. (d) follows from 

To obtain (e), it is necessary to consider the following four cases. 

Case 1. an{z,z') = 1 and a„+i(z, z') = 1. Then, |A„| =| qn{z')-qn+iiz') \ < Kgo{z')an 
by (4). 

Case 2. an{z,z') < 1 and an+i{z,z') < 1. 

I A„(z,z') I = I qn{z) - qn+i{z) I < K^^^^go{z)an. 

tt[z) t^KZ) 

Case 3. an{z,z') = 1 and q;„+i(2;, z') < 1. In this case A„(2;, z') = qn{z')—Tr{z')qn+i{z)/Tr{z). 
If A„(z,z') > 0, then 

< A„(2;, z') < — ^ \qn{z) - qn+\{z) \ < Kgo{z)anCk ■ 

If An{z,z') < 0, then 

■k(z') 

< -An{z,z') = —-^qn+iiz) - qn{z') < qn+i{z') - qn{z') 
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Thus, 



Case 4. Uniz, z') < 1 and an+iiz, z') = 1. This case is similar to case 3. 
To obtain (f), we note that 

I T^n{z) - Vn+l{z) \ < ^ |A„(z,/)| , 

z' 

and the result follows from (e). ■ 

With ei as in Lemma 1, choose < e < ei and let 



(21) 



Then, Rn{z, z') is a one-step transition kernel with the following properties. 
Lemma 2 (a) 

Y,T^{z)Rn{z,z')=Tr{z') . 
z 

(b) 

Rn{z, z') < Kgo{Zn) + IjSziz') 

where < 77 < 1. 
(c) 

I Rn{z,z') - Rn+i{z,z') \ < Ka„| (go{z') + ^5o(^)) + (l + f^)<^^(^') 
(d) 



XI ■ ■ ■ XI n Rn-k{Zn-k, Zn-k+l) < Kgo{Zn) + v'^^Zn-miZn) 

Zn-m+l Zn-1 k=l 
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(e) For 1 < Z < i — 1 and j = 1, . . . ,n, 

i-i 

X] X] ^(^n-i+l) Y\. ^n-j{Zn-k,Zn-k+l) = T^{Zn-l) 

^n-l-l Zn-j+1 fe=l+l 

(f) Forj = l,...,nandl = l,...,j-l, 

Zn-j+l Zn-1 k=l+l 

X ( Rn-l{Zn-h Zn-l+l) - Rn-j{Zn-l, Zn-l+l) j Y\ ^n-kiZri-k, ^n-fc+l) I < K go{Zn)an-jij — 
^ ^ k=l 

Proof, (a) follows from (21) and (18). (b) follows from (21). (c) follows from (19) and 
(20). (d) is true for m = 1 and is obtained in general by induction, (e) follows from part (a), 
(f) follows from parts (a) to (e). ■ 

Proof of Theorem 1. Let Aj,j = 1,2, ... be an independent Bernoulli process such 
that Aj = 1 with probability e and Aj = with probability 1 — e. From (21), Tn{z,z') = 
(1 — e)Rn{z, z') + €n{z') so that we can interpret Tn{z, z') as a mixture of transition kernels, 
such that Tn{z, z') = Rn{z, z') if A„ = and Tn{z, z') = Tr{z') if A„ = 1. For j = 1, . . . , n, 
let Anj be the event that A^-j+i = 1, = 0,k = n — j + 2, . . . ,n. Let 5„ be the event 
that Aj = for j = 1, . . . , n. Then Pr(A„j) = e(l - e)^-^ and Pr(5„) = (1 - e)", and 

n 

Pr(Z„ = Zn) = Y, ^A^n = Zn \ Anj) Pr(^„j) + Pr(Z„ = Zn | 5„) Pr(5„). 

As in the proof of Theorem 1 in Nott and Kohn (2005), we can write Pv{Zn = Zn\An,j) = 
Co,j + Cij H h Cj-ij, where 

i-i 

Cl,j = X ■ ■ ■ X -P^(^0:n-j = Zo:n-j)n{Zn-j+l) JJ Rn-j{Zn-k, Zn-k+l) 

^0 Zn-l k=l+l 

{ Rn-l{Zn-l, Zn-l+l) - Rn-j{Zn-l, Zn-l+l) j Rn-k{Zn-k, ^n-fc+l) 
^ ^ fc=l 

From part (e) of Lemma 2, Cqj = Tr{zn) and by part (f) of Lemma 2, | Cj^n |< KgQ{zn)an-j{j— 
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I) for J > 1. Using a similar argument to that in Nott and Kohn (2005), this imphes that 



Thus, 



i-i 

1=0 



<7r{zn) + Kgo{zn){n-j) '''f 



XI = Zn\An,j) Pr(A„j) = 7r(zn) - (1 - e)"'Tr{zn) + rin where 



n-l 



rin\< Kn-''J2(^-i;) iMl-er'- 

3=1 



-ri 



We also have that 



n— 1 



Pl{Zn = Zn\ Bn) = ^---^ qi{zo)qi{zi) Rk{Zk, Zk+l) 

ZO Zn-1 k=l 

< Kgo{zn) + r]''^^qi{zn) < Kgo{zn) • 



using Lemma 2 (c) and (3) . Hence, 



Pl{Zn = Zn) - 7T{Zn) \ < Kgo{zn) (1 - e)" + n 



(22) 



The proof of Theorem 1 follows. ■ 

The proof of Theorem 2 is similar to that in Nott and Kohn (2005) if we use (22). 
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Figure 1: Example 1. Left panel: Proposal distribution (bold) after 15 000 iterations, 
initialized with a normal ^(z; — 5, 4). The target density (thin) is a univariate mixture 
0.5cl){z; 0, 1) + 0.3(f){z: —3, 4) + 0.2(j){z; 6, 0.5). Right panel: Recursive updates of the accep- 
tance rate in the last 500 iterations. 



true (thin) ond estimated (bold) pdf. first coefficient true (thin) and estimated (bold) pdf, lost coefficient 




X 10* 

Figure 2: Example 2. Proposal distribution after 35 000 iterations. The target distribution 
is a 15-dimensional mixture. The graph plots the true marginal distributions (thin) for the 
first and last variable together with the corresponding marginal distributions (bold) implied 
by the mixture of normals estimated on the full history of the draws, and with recursive 
updates of the acceptance rate in the last 500 iterations. 
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Ln(lambdoO) 



Ln(lombdo 1 ) 




Figure 3: Inference for a time varying parameter AR(1) model for US inflation by Gibbs 
sampling, (a) marginal distribution of In(Ao) (b) marginal distribution of In(Ai) (c) marginal 
distribution Ai (d) inflation plot and mean, estimated as E[(ct/1 — Pt)\y] (e) marginal 
distribution of po\y (f) E(ct\y) (bold line) and E{pt\y). 
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acceptance ratio, last min(lt.600) recursive estimates of In(tau) stde 




Figure 5: Inference for semiparametric model of housing prices by adaptive IMH. First 
row: recursive acceptance rate for the last min(it,500) iterations, recursive means of In(ri), 
marginal of a^. Second and third rows: marginals of In(Ti). Fourth and fifth rows: means of 

PiX + gi{x). 
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returns (y) Mean of exp(0.5*h) 




soaa 



rho acceptance rate moving window 




Figure 6: Inference for the daily US-GBP exchange rate by AIMH. (a) exchange rate returns 
(b) mean of 0.51n(/it) (c) marginal of (d) marginal of 0.5exp(//) (e) marginal of p (f) 
moving window of the acceptance rate for the last min(it,500) iterations. 
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